Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-launch-installer-e2e: Restore EXIT and TERM signal handlers #1957

Merged

Conversation

wking
Copy link
Member

@wking wking commented Oct 17, 2018

I'd removed these in 3f2f01c (#1677) while removing the shell from the setup container. But we got a shell back in b20d571 (#1761), and restoring these handlers will help:

  • Trigger early pod wrap-up when the setup container dies. Both the test and teardown containers are monitoring the 'exit' file so they can bail early.

  • Forward TERM to the installer, so we can exit gracefully when someone TERMs the shell script.

For the signal handling to work, we need to keep the shell process around, so I've dropped the exec from the openshift-install invocation and moved it to the background instead (so jobs -p will list it).

Supersedes #1956
CC @smarterclayton @ironcladlou

I'd removed these in 3f2f01c
(ci-operator/config/openshift/installer/master: Move to
openshift-install, 2018-09-26, openshift#1677) while removing the shell from
the setup container.  But we got a shell back in b20d571
(cluster-launch-installer-e2e: Start setting expirationDate again,
2018-09-29, openshift#1761), and restoring these handlers will help:

* Trigger early pod wrap-up when the setup container dies.  Both the
  test and teardown containers are monitoring the 'exit' file so they
  can bail early.

* Forward TERM to the installer, so we can exit gracefully when
  someone TERMs the shell script.

For the signal handling to work, we need to keep the shell process
around, so I've dropped the 'exec' from the openshift-install
invocation and moved it to the background instead (so 'jobs -p' will
list it).
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 17, 2018
@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 17d2b89 into openshift:master Oct 17, 2018
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the prow-job-cluster-launch-installer-e2e configmap using the following files:

  • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

In response to this:

I'd removed these in 3f2f01c (#1677) while removing the shell from the setup container. But we got a shell back in b20d571 (#1761), and restoring these handlers will help:

  • Trigger early pod wrap-up when the setup container dies. Both the test and teardown containers are monitoring the 'exit' file so they can bail early.

  • Forward TERM to the installer, so we can exit gracefully when someone TERMs the shell script.

For the signal handling to work, we need to keep the shell process around, so I've dropped the exec from the openshift-install invocation and moved it to the background instead (so jobs -p will list it).

Supersedes #1956
CC @smarterclayton @ironcladlou

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the installer-setup-signal-handlers branch October 18, 2018 20:21
wking added a commit to wking/openshift-release that referenced this pull request Oct 26, 2018
From the POSIX spec [1]:

  If the wait utility is invoked with no operands, it shall wait until
  all process IDs known to the invoking shell have terminated and exit
  with a zero exit status.

  If one or more pid operands are specified that represent known
  process IDs, the wait utility shall wait until all of them have
  terminated. If one or more pid operands are specified that represent
  unknown process IDs, wait shall treat them as if they were known
  process IDs that exited with exit status 127. The exit status
  returned by the wait utility shall be the exit status of the process
  requested by the last pid operand.

Because the bare 'wait' from d862f6f (cluster-launch-installer-e2e:
Restore EXIT and TERM signal handlers, 2018-10-17, openshift#1957) contained no
PID arguments, it was returning zero even when the install command
failed.  Here's a simple example demonstrating the effect:

  $ false &
  $ wait
  $ echo $?
  0

With this command, we use $! to get the installer's process PID [2],
and pass that to 'wait'.  Now the wait call will exit with the
installer exit status, and because it's the last command in the shell
script, the script will also exit with the installer exit status.
Here's a simple example demonstrating the new approach:

  $ false &
  $ wait "$!"
  $ echo $?
  1

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/wait.html
[2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_05_02
wking added a commit to wking/openshift-release that referenced this pull request Jan 24, 2019
…te cluster'

Unwinding part of 7e56379 (templates/openshift: grab bootstrap log on
failure, 2019-01-14, openshift#2633).  I'd initially added the backgrounding in
d862f6f (cluster-launch-installer-e2e: Restore EXIT and TERM signal
handlers, 2018-10-17, openshift#1957).  If you exec 'create cluster', it kills
the traps, but I was less clear on whether the background/wait dance
was required vs. the approach taken by 7e56379.  Here are scripts
comparing the two approaches vs. SIGTERM (e.g. if a test run is
aborted because new updates have been pushed to a pull request):

  $ cat test-sleep.sh
  #!/bin/sh
  trap 'rc=$?; if test "${rc}" -eq 0; then echo touch /tmp/setup-success; else echo touch /tmp/exit; fi; exit "${rc}"' EXIT
  trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then echo kill ${CHILDREN} && kill ${CHILDREN}; fi' TERM
  echo $$
  python -c 'import time; time.sleep(20); print("done sleeping")'
  $ ./test-sleep.sh  # while this is running, in another shell: kill 17125
  17125
  done sleeping
  touch /tmp/setup-success

That's not good, the Python process should have been killed before it
got far enough in to print "done sleeping".  Here's with a
background/wait:

  $ cat test-sleep-wait.sh
  #!/bin/sh
  trap 'rc=$?; if test "${rc}" -eq 0; then echo touch /tmp/setup-success; else echo touch /tmp/exit; fi; exit "${rc}"' EXIT
  trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then echo kill ${CHILDREN} && kill ${CHILDREN}; fi' TERM
  echo $$
  python -c 'import time; time.sleep(20); print("done sleeping")' &
  wait "$!"
  $ ./test-sleep-wait.sh  # kill 17242
  17242
  kill 17243
  touch /tmp/exit

that looks better.  Stracing to put a finger on the differences
(e.g. 'strace -tfo /tmp/trace ./test-sleep.sh'), gives:

  21243 21:59:18 execve("./test-sleep.sh", ["./test-sleep.sh"], [/* 74 vars */]) = 0
  ...
  21243 21:59:18 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe60e216a10) = 21244
  ...
  21244 21:59:18 execve("/usr/bin/python", ["python", "-c", "import time; time.sleep(20); pri"...], [/* 73 vars */]) = 0
  ...
  21244 21:59:18 select(0, NULL, NULL, NULL, {20, 0} <unfinished ...>
  21243 21:59:25 <... wait4 resumed> 0x7ffe9feef350, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  21243 21:59:25 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=8603, si_uid=114032} ---
  ...
  21244 21:59:38 write(1, "done sleeping\n", 14) = 14
  21244 21:59:38 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f20f7b88680}, {0x7f20f7eab680, [], SA_RESTORER, 0x7f20f7b88680}, 8) = 0
  21244 21:59:38 exit_group(0)            = ?
  21244 21:59:38 +++ exited with 0 +++
  21243 21:59:38 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 21244
  ...
  21243 21:59:38 pipe([3, 4])             = 0
  ...
  21243 21:59:38 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe60e216a10) = 21262
  21243 21:59:38 rt_sigprocmask(SIG_SETMASK, [TERM], NULL, 8) = 0
  21243 21:59:38 rt_sigaction(SIGCHLD, {0x441390, [], SA_RESTORER|SA_RESTART, 0x7fe60d850280},  <unfinished ...>
  21262 21:59:38 close(255 <unfinished ...>
  ...
  21243 21:59:38 read(3,  <unfinished ...>
  ...
  21262 21:59:38 close(4)                 = 0
  ...
  21262 21:59:38 exit_group(0)            = ?
  21243 21:59:38 <... read resumed> "", 128) = 0
  21262 21:59:38 +++ exited with 0 +++
  21243 21:59:38 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21262, si_uid=114032, si_status=0, si_utime=0, si_stime=0} ---
  21243 21:59:38 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 21262
  ...
  21243 21:59:38 write(1, "touch /tmp/setup-success\n", 25) = 25
  21243 21:59:38 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  21243 21:59:38 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  21243 21:59:38 exit_group(0)            = ?
  21243 21:59:38 +++ exited with 0 +++

so it looks like the shell (21243) just swallows the TERM, waits out
the sleeping Python process, and exits zero.  21262 is the 'jobs'
process, which doesn't fire until Python finishes its sleep and
returns.  The trap is a no-op, because test-sleep.sh never spawns
backgrounded children, so 'jobs' returns an empty string.

And here's an strace on test-sleep-wait.sh:

  21638 22:04:48 execve("./test-sleep-wait.sh", ["./test-sleep-wait.sh"], [/* 74 vars */]) = 0
  ...
  21638 22:04:48 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f78dc7c4a10) = 21639
  ...
  21639 22:04:48 execve("/usr/bin/python", ["python", "-c", "import time; time.sleep(20); pri"...], [/* 73 vars */]) = 0
  ...
  21639 22:04:48 select(0, NULL, NULL, NULL, {20, 0} <unfinished ...>
  21638 22:04:54 <... wait4 resumed> 0x7ffc07d607a0, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  21638 22:04:54 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=8603, si_uid=114032} ---
  ...
  21638 22:04:54 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f78dc7c4a10) = 21645
  ...
  21645 22:04:54 write(1, "21639\n", 6)   = 6
  ...
  21645 22:04:54 exit_group(0)            = ?
  21645 22:04:54 +++ exited with 0 +++
  21638 22:04:54 <... read resumed> "", 128) = 0
  21638 22:04:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21645, si_uid=114032, si_status=0, si_utime=0, si_stime=0} ---
  21638 22:04:54 rt_sigreturn({mask=[TERM]}) = 0
  ...
  21638 22:04:54 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 21645
  ...
  21638 22:04:54 write(1, "kill 21639\n", 11) = 11
  21638 22:04:54 kill(21639, SIGTERM)     = 0
  ...
  21638 22:04:54 write(1, "touch /tmp/exit\n", 16 <unfinished ...>
  21639 22:04:54 +++ killed by SIGTERM +++
  21638 22:04:54 <... write resumed> )    = 16
  21638 22:04:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=21639, si_uid=114032, si_status=SIGTERM, si_utime=0, si_stime=0} ---
  21638 22:04:54 wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 21639
  21638 22:04:54 wait4(-1, 0x7ffc07d60010, WNOHANG, NULL) = -1 ECHILD (No child processes)
  21638 22:04:54 rt_sigreturn({mask=[]})  = 16
  21638 22:04:54 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  21638 22:04:54 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  21638 22:04:54 exit_group(143)          = ?
  21638 22:04:54 +++ exited with 143 +++

With this approach, we see the shell (21638) catch the TERM and
immediately launch the trap handler which launches 'jobs' (21645).
The trap handler sends a new TERM to the Python process (21639), which
exits non-zero.  The shell reaps the 'jobs' process, echos "touch
/tmp/exit", reaps the killed Python process, and exits 143.  From the
ordering there it looks like the shell isn't actually waiting for the
Python process, although we could add a wait to the TERM handler if we
wanted.  Let's do that, and also drop some of the debugging echos now
that we're committed to strace:

  $ cat test-sleep-wait-in-handler.sh
  #!/bin/sh
  trap 'rc=$?; if test "${rc}" -eq 0; then touch /tmp/success; else touch /tmp/exit; fi; exit "${rc}"' EXIT
  trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi' TERM
  echo $$
  python -c 'import time; time.sleep(20); print("done sleeping")' &
  wait "$!"
  $ strace -tfo /tmp/trace ./test-sleep-wait-in-handler.sh
  23095

As above, I then 'kill 23095' from another terminal.  Excerpts from
the trace:

  23095 22:27:03 execve("./test-sleep-wait-in-handler.sh", ["./test-sleep-wait-in-handler.sh"], [/* 74 vars */]) = 0
  ...
  23095 22:27:03 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e464f2a10) = 23096
  ...
  23096 22:27:03 execve("/usr/bin/python", ["python", "-c", "import time; time.sleep(20); pri"...], [/* 73 vars */]) = 0
  ...
  23096 22:27:03 select(0, NULL, NULL, NULL, {20, 0} <unfinished ...>
  23095 22:27:09 <... wait4 resumed> 0x7ffe54e18750, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  23095 22:27:09 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=8603, si_uid=114032} ---
  ...
  23095 22:27:09 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e464f2a10) = 23110
  ...
  23110 22:27:09 write(1, "23096\n", 6)   = 6
  23095 22:27:09 <... read resumed> "23096\n", 128) = 6
  ...
  23110 22:27:09 +++ exited with 0 +++
  23095 22:27:09 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23110, si_uid=114032, si_status=0, si_utime=0, si_stime=0} ---
  ...
  23095 22:27:09 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 23110
  ...
  23095 22:27:09 kill(23096, SIGTERM)     = 0
  23096 22:27:09 <... select resumed> )   = ? ERESTARTNOHAND (To be restarted if no handler)
  23095 22:27:09 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
  23096 22:27:09 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=23095, si_uid=114032} ---
  ...
  23095 22:27:09 wait4(-1,  <unfinished ...>
  23096 22:27:09 +++ killed by SIGTERM +++
  23095 22:27:09 <... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 23096
  ...
  23095 22:27:09 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=23096, si_uid=114032, si_status=SIGTERM, si_utime=1, si_stime=0} ---
  23095 22:27:09 wait4(-1, 0x7ffe54e17810, WNOHANG, NULL) = -1 ECHILD (No child processes)
  ...
  23095 22:27:09 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e464f2a10) = 23111
  ...
  23111 22:27:09 execve("/usr/bin/touch", ["touch", "/tmp/exit"], [/* 73 vars */]) = 0
  ...
  23111 22:27:09 open("/tmp/exit", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
  ...
  23111 22:27:09 exit_group(0)            = ?
  23111 22:27:09 +++ exited with 0 +++
  23095 22:27:09 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23111
  23095 22:27:09 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  23095 22:27:09 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23111, si_uid=114032, si_status=0, si_utime=0, si_stime=0} ---
  23095 22:27:09 wait4(-1, 0x7ffe54e18050, WNOHANG, NULL) = -1 ECHILD (No child processes)
  ...
  23095 22:27:09 exit_group(143)          = ?
  23095 22:27:09 +++ exited with 143 +++

With this approach, we see the shell (23095) catch the TERM and
immediately launch the trap handler which launches 'jobs' (23110) and
reaps it.  The trap handler sends a new TERM to the Python process
(23096), which exits non-zero.  The shell reaps the killed Python
process, waits but gets no children, launches and reaps the touch
process (23111), and exits 143.  That looks like what we want, so I've
gone that way with this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants